Mac Mania 6

home *** CD-ROM | disk | FTP | other *** search

/ Mac Mania 6 / MacMania 6.toast / / Tools&Utilities / EnterAct Stuff / Converting text to HTML / move to hAWK programs / $TextToHTML

Wrap

Text File | 1996-12-05 | 23.1 KB | 719 lines | [TEXT/KEEN]

# This program generates a set of HTML documents corresponding to a text document, # one HTML doc for each chapter and a table of contents. Your text document must # indicate formatting such as titles and list using a specific set of structures # such as dashed lines and bullets, and this structure must always be just before # the text to be formatted (additional structure that ends the formatting may # be placed after the text). Details below. # If your document contains illustrations, you must have created the document with # EnterAct 3.7 or later (which places all PICTs in the resource fork and numbers # them 1000, 1001, 1002...1400). More exactly, if you have an old EnterAct document # you must insert or delete at least one picture using EnterAct 3.7 or later, in # order to force renumbering of the pictures. If you're not sure, use a resource # editor to check the numbering of the PICTs in the doc's resource fork. # An overview of the whole process, using "EnterAct 3 Manual" as an example: # • get a copy of "clip2gif" by Yves Piguet (check your CD's, or Anarchie to Info-Mac) # • structure your document ("EnterAct 3 Manual"), using EnterAct if you have PICT illustrations # (more on this important step below) # • create a folder to hold the html version of your document, and within it # create a folder to hold the gifs ("Disk:...:E3M_HTML:Graphics:") # • run the script "PICT rsrcs to numbered gifs" from within EnterAct: in the # first dialog, select the document to convert ("EnterAct 3 Manual"); in the second # dialog, select the folder to hold the gifs ("Graphics"). NOTE to avoid being pestered # about where "clip2gif" is, you should open the script using Apple's "Script Editor" first, # and then after you have relocated "clip2gif" (you will be asked to do so by Script Editor) # do a Save. # • when "PICT rsrcs to numbered gifs" is done, it presents you with a command # line to run this program: select it and hit <enter> to run it. You should also # save the command line(s) away somewhere for future reference. It looks like this: # hAWK -f$TextToHTML -vgifList="Disk:CW CEDAR:E3M_HTML:Graphics:Unsorted gif list" # -- "Disk:CW CEDAR:EnterAct Stuff:Documentation:EnterAct 3 Manual" # (it will be stored in the file "$tempScriptResult" until you run your next script with EnterAct) # • when this program is done, there will be a "Contents.html" file at the top # of the folder that holds the html version of your document, and also # within it will be a "Text" folder holding the chapter documents, and # your folder for the gifs will be chock full of gifs. # • drag the "Contents.html" file onto your favourite browser and verify that # things came out the way you wanted. # If your document contains no illustrations then you don't need to run the # "PICT rsrcs to numbered gifs" script first. But you still need to construct # a command line to run this program. First create a folder to hold the results # (eg "Disk:...:MyDoc HTML:"). Remember the full path names for this folder and for # your source document ("MyDoc"). Your command line should then look like this: # hAWK -f$TextToHTML -vgifList="Disk:...:MyDoc HTML:Graphics:Unsorted gif list" # -- "Disk:...:MyDoc" # Note that neither the "Graphics" folder nor the file "Unsorted gif list" will # exist, but you still need to include them in the command line argument. This # is a minor nuisance, but if you ever decide to include illustrations in your # document then all you have to do is run the script "PICT rsrcs to numbered gifs", and # then run the exact same command line that you produced above (or run the # command line that the script produces, which will be exactly the same). # The only tricky bit here is structuring your document so that this program can # convert the structure into proper html formatting. Characters in your text have to # clearly signal where headings are, where lists start, and so on. This program is # set up to handle a specific fixed structure, and if your document doesn't follow # this structure you'll have to change either your document or this program, # whichever seems easier. Note however that in all cases these structures must be # placed just before the text that is to be formatted, though additional structure # can follow the text to signal the end of formatting. # The "EnterAct 3 Manual" is an example of a document structured for use with # this program: you might want to browse through it as you read the rules below. # Here are the structuring rules that this program uses by default: # • First line: presumed to be the title, skip it (the title is instead taken from # the name of the document) # • Each chapter title should be between "dash-space" lines - - - - -, # that is one dash-space line before the chapter title, and another # dash-space line immediately after. The "dash space" line should begin # with a dash '-', and the line should contain only dashes and spaces in # any mix you like after the starting dash, at least one additional character # after the starting dash. # • If you have a table of contents in your document, it should be preceded with # the chapter title "CONTENTS" (between dash-space lines). The entire table # of contents will be skipped, and regenerated by this program. Note your # table of contents should be followed by a chapter title, since everything # from the "CONTENTS" chapter title to the next chapter title will be skipped. # • Within each chapter, major subheadings that should be included in the main # table of contents should be formatted "§\tSub section name". Subheadings # that should NOT be shown in the main table of contents but SHOULD be shown # in the table of contents for the chapter itself should be formatted # "(§)\tSub section name". # • Subheadings that should not be shown in any table of contents should take # the form ">\tSub section name". # • Any illustrations (PICT) must have been inserted using EnterAct, and you must # have inserted or deleted at least one PICT using v3.7 or later of EnterAct. # • Major lists begin with "\t•". Sublists begin with "\t\t•". There should be no # blank lines within a list, since a blank line signals the end of the list # (including any sublists). # • Chunks of text that are to be quoted as-is with no reformatting should be # between lines that consist of exactly four underscores "____". # If you wish to modify the structure that this program uses, you'll need to change # the function "InitStructurePatterns()" below. Note that structuring information is expected # to come BEFORE the text to be formatted (in chapter title the dash-space line comes # before and after the title, but the one after the title is just for the sake of # appearance in the non-html version). If you want any structuring information to # come AFTER the text (such as indicating a chapter title by just a dash-space line # following the title, no dash-space line before it) then you will have to buffer # the lines as they are read in from your document, since you won't know that # formatting is required until you've looked at the line after the one that needs # to be formatted. And you'll have to buffer the output too for the same reason. # For some help handling this, see # «hAWK User’s Manual» «R 2 Beyond input records» # especially the part a couple of pages in, about "End–buffered input". # The function SkipUnwantedSections() looks for specific chapter titles, # by default just "CONTENTS", and then skips ahead to the next chapter. # To change which chapter are skipped, modify the first line in this function, # if ($0 ~ /^[ \t]*CONTENTS[ \t]*$/) # to match the names of chapters that should be skipped. # TOC and anchor handling: # Given "Name of Heading" and heading level H2 or H3: # 1 replace with <Hn><A NAME = "Name of Heading" >Name of Heading</A></Hn> # 2 if H2 head, increment h2Counter # 3 accumulate headings in TOC[h2Counter], with SUBSEP in between # 4 at end split each array entry, output URLs in unordered lists for TOC. BEGIN { InitSpecialCharacters(); InitStructurePatterns(); # Remember full name of main document. inputFile = ARGV[1]; # Get title and main head from name of document. n = split(inputFile, names, ":"); theMainTitle = names[n]; theMainHead = names[n]; contentsMarker = "TOC GOES HERE"; inList = 0; listElement = 0; h2Counter = 0; newParagraphComing = 1; currentGIF = 0; numGIFs = 0; doingGIF = 0; doingAsIs = 0; # "gifList", full path to Unsorted gif list, should be preset # in dialog or on the command line. gifArrayFile = gifList; outFile = ""; contentsFileName = "Contents.html"; n = split(gifArrayFile, names, ":"); for (i = 1; i < n - 1; ++i) { chaptersFolder = chaptersFolder names[i] ":"; } graphicsFolderName = names[i]; gifPartialLocation = "../" graphicsFolderName "/"; # eg "../Graphics/" contentsFileLocation = chaptersFolder; chaptersFolder = chaptersFolder "Text:"; ## eg chaptersFolder = STDPATH "E3M_HTML:Text:" MakeFolder(chaptersFolder); # If doc has no illustrations, make the "Graphics" folder -- life is simpler. graphicsFolderPath = contentsFileLocation graphicsFolderName ":"; MakeFolder(graphicsFolderPath); LoadGIFNames(); # The main event: inhale all the lines, write the chapters, write main contents. DoTheLines(); WriteMainContentsFile(); # Notify we're done. print "HTML conversion of", theMainTitle, "complete."; } # Two backslashes are needed before a '&' to use it literally in a quoted pattern # when doing substitution, otherwise it means "the entire pattern that was matched". function InitSpecialCharacters() { ampersand = "\\&"; lessThan = "\\<"; greaterThan = "\\>"; quote = "\\""; euroLeft = "\\«"; euroRight = "\\»"; bullet = "\\·"; dornk = "\\¬"; section = "\\§"; para = "\\¶"; cedillaC = "\\ç"; shy = "\\"; copyright = "\\©"; registration = "\\®"; question = "\\?"; spaceSub = "_"; } # NOTE the metacharacters # \ ^ $ . [ ] | ( ) * + ? & # must be preceded by TWO backslashes if you want to match them # literally in a quoted regular expression. Also note the regular expression # does NOT overlap the actual text in any case--this allows us to remove # just the structure signals easily with sub(structure["thing"], ""); function InitStructurePatterns() { structure["preformatted"] = "^[ \t]*____[ \t]*$"; structure["any_list"] = "^[\t]+•[ \t]*"; structure["top_list"] = "^\t•"; structure["sub_list"] = "^\t\t•"; structure["chapter_title"] = "^-[ \t-]+$"; structure["section_title_shown"] = "^§\t"; structure["section_title_not_shown"] = "^\$§\$\t"; # NOTE the double backslashes structure["subsection_title"] = "^>\t"; structure["gif_marker"] = "^ $"; } # do for all input lines function DoTheLines() { getline < inputFile; # skip the first line while ((getline < inputFile) > 0) { # Stop skipping blank lines if we were doing a GIF and hit nonblank if ($0 != "") doingGIF = 0; # "As is" sections shouldn't contain any other "structure" if ($0 ~ structure["preformatted"]) { doneFirstPRELine = 0; while ((getline < inputFile) > 0) { if ($0 ~ structure["preformatted"]) { if (doneFirstPRELine == 1) PrintToOutFile("</PRE>"); break; } else { ReplaceSpecialCharacters(); if (doneFirstPRELine == 1) PrintToOutFile($0); else { PrintToOutFile("<PRE>" $0); doneFirstPRELine = 1; } } } } # Remember if list element starting (one or two tabs, bullet) else if ($0 ~ structure["any_list"]) { if ($0 ~ structure["top_list"]) { # We may be starting a brand new list if (inList == 0) { PrintToOutFile("<UL>"); inList = 1; } else { # A new list element ends any subelement if (subListElement == 1) { PrintToOutFile("\t</UL>"); subListElement = 0; } } listElement = 1; # Print the first line of the new list element sub(structure["any_list"], ""); ReplaceSpecialCharacters(); PrintToOutFile("<LI>" $0); } else if ($0 ~ structure["sub_list"]) { # First subelement, start a new list if (subListElement == 0) { PrintToOutFile("\t<UL>"); } subListElement = 1; # Print the first line of the new list subelement sub(structure["any_list"], ""); ReplaceSpecialCharacters(); PrintToOutFile("\t<LI>" $0); } } # A list finishes with a blank line (also signals new paragraph) else if ($0 == "") { if (inList) { if (subListElement == 1) PrintToOutFile("\t</UL>"); if (listElement == 1) PrintToOutFile("</UL>"); inList = 0; listElement = 0; subListElement = 0; } # Print blank lines, unless we're skipping GIF space if (doingGIF == 0) PrintToOutFile(""); newParagraphComing = 1; } # Top level heading H2, between dashed lines # This heading starts a new file else if ($0 ~ structure["chapter_title"]) { getline < inputFile; # Skip the "CONTENTS" section SkipUnwantedSections(); # --having trouble with quotes in chapter titles SimplifyQuotes(); ReplaceSpecialCharacters(); # make sure we can use text as a file name gsub(/\t/, " "); gsub(/:/, " "); StartNewChapter($0); TOC[++h2Counter] = $0 SUBSEP; # Skip following dashed line. getline < inputFile; } # Second level heading, shown in main contents else if ($0 ~ structure["section_title_shown"]) { sub(structure["section_title_shown"], ""); ReplaceSpecialCharacters(); TOC[h2Counter] = TOC[h2Counter] $0 SUBSEP; PrintHeading($0, "H3"); } # Second level heading, NOT shown in main contents # -- name is preceded with "!" in TOC array to signal that. else if ($0 ~ structure["section_title_not_shown"]) { sub(structure["section_title_not_shown"], ""); ReplaceSpecialCharacters(); TOC[h2Counter] = TOC[h2Counter] "!" $0 SUBSEP; PrintHeading($0, "H3"); } # Third level heading, not in any contents (although it does have an anchor) else if ($0 ~ structure["subsection_title"]) { sub(structure["subsection_title"], ""); ReplaceSpecialCharacters(); PrintHeading($0, "H4"); } # GIF entry, <option><space> by itself on a line else if ($0 ~ structure["gif_marker"]) { PrintGIFTag(gifPartialLocation, gifName[++currentGIF]); doingGIF = 1; } # Regular line, just replace special characters and print it. else { ReplaceSpecialCharacters(); if (newParagraphComing == 1) PrintToOutFile("<P>" $0); else PrintToOutFile($0); newParagraphComing = 0; } } } # Load AND sort the gif names, to gifName[ 1..numGIFs ]. # (gif name format is "arbtext#ddddd.gif" where d is a digit) function LoadGIFNames( x, p, a, b, numSpot, numA, trueP, i) { numGIFs = 0; numA = 0; if (exists(gifArrayFile)) { while ((getline x < gifArrayFile) > 0) p[++numGIFs] = x; for (i = 1; i <= numGIFs; ++i) { numSpot = match(p[i], /#[0-9]+/); # allow other files in folder, or other text in list if (numSpot > 0) { a[++numA] = substr(p[i], numSpot+1, RLENGTH-1); trueP[numA] = p[i]; } } if (numA+0 > 0) { sort(a,b,"n"); for (i = 1; i <= numA; ++i) { gifName[i] = trueP[b[i]]; } } numGIFs = numA; } } # Print to specific file, if there is one. function PrintToOutFile(s) { if (outFile != "") print s > outFile; } function ReplaceSpecialCharacters() { gsub(/\&/, ampersand); # keep this one first gsub(/</, lessThan); gsub(/>/, greaterThan); gsub(/«/, euroLeft); gsub(/»/, euroRight); gsub(/•/, bullet); gsub(/¬/, dornk); # does this have a real name? gsub(/§/, section); gsub(/¶/, para); gsub(/ç/, cedillaC); gsub(/—/, shy); gsub(/©/, copyright); gsub(/®/, registration); # Question mark gives trouble in anchors, do it too #gsub(/\?/, question); # Short dash -- where is it in the ISO Latin-1 set?? gsub(/–/, "-"); # Ditto "…" gsub(/…/, "..."); # And hey, where's ƒ? gsub(/ƒ/, "f"); # straighten out the quotes and ticks gsub(/“/, "\""); gsub(/”/, "\""); gsub(/‘/, "'"); gsub(/’/, "'"); # do quotes last since we may have generated some new ones gsub(/"/, quote); } function SimplifyQuotes() { gsub(/“/, "'"); gsub(/”/, "'"); gsub(/‘/, "'"); gsub(/’/, "'"); gsub(/"/, "'"); } # Skip the "CONTENTS" section. This is called for the beginning of all chapters, # so you could add a counter and selectively skip more than one chapter in different # places throughout your document. function SkipUnwantedSections() { if ($0 ~ /^[ \t]*CONTENTS[ \t]*$/) { getline < inputFile; while ((getline < inputFile) > 0) { if ($0 ~ structure["chapter_title"]) { getline < inputFile; break; } } } } # Finish writing current chapter, if any. Start a new temp file for # next chapter (tack "x" onto chapter name for the temp version) # and pump out the starting HTML. function StartNewChapter(chapterName, truncatedName, nameLength) { if (outFile != "") FinishCurrentChapter(); truncatedName = TempFileNameForChapter(chapterName); outFile = chaptersFolder truncatedName; StartHTML(chapterName); } # Clean name, append .html, keep name short enuff. function TempFileNameForChapter(chapterName, fileName, nameLength) { fileName = chapterName ".html" "x"; nameLength = length(fileName); if (nameLength > 31) fileName = substr(chapterName, 1, 25) ".html" "x"; gsub(/ /, spaceSub, fileName); return fileName; } function FileNameForChapter(chapterName, fileName, tempName, nameLength) { tempName = TempFileNameForChapter(chapterName); nameLength = length(tempName); fileName = substr(tempName, 1, nameLength - 1); return fileName; } # Close temp file for chapter; copy it to final version, inserting # TOC at top; delete temp file. function FinishCurrentChapter( nameLength) { PrintToOutFile("<P>"); DoChapterTOC(); EndHTML(); close(outFile); oldOutFile = outFile; nameLength = length(outFile); outFile = substr(outFile, 1, nameLength - 1); WriteFinalChapter(); close(outFile); close(oldOutFile); remove(oldOutFile); # temporarily, we have no outFile to write to outFile = ""; } function StartHTML(chapterName) { PrintToOutFile("<HTML>"); PrintToOutFile(""); PrintToOutFile("<HEAD>"); PrintToOutFile("<TITLE>" chapterName "</TITLE>"); PrintToOutFile("</HEAD>"); PrintToOutFile(""); PrintToOutFile("<BODY>"); PrintToOutFile("<H1>" chapterName "</H1>"); PrintToOutFile("<HR>"); PrintToOutFile(""); PrintToOutFile(contentsMarker); # table of contents goes here on 2nd pass } function EndHTML() { PrintToOutFile(""); PrintToOutFile("</BODY>"); PrintToOutFile(""); PrintToOutFile("</HTML>"); PrintToOutFile(""); PrintToOutFile(""); } # Print a heading and named anchor. "level" should be "H2", "H3" etc. # Having trouble with "?" in name, so leave it out of anchor name. function PrintHeading(name, level) { PrintToOutFile("<" level "><A NAME = \"" NoQuestionVersionOf(name) "\" >" name "</A></" level ">"); } # Replace question marks, and spaces. function NoQuestionVersionOf(name) { gsub(/\?/, "", name); gsub(/ /, spaceSub, name); return name; } # All pictures are in ":Graphics:" beside ":Text:", and so "theLocation" # says go one level up and then down into Text. function PrintGIFTag(theLocation, theGIFName, copyOfName) { PrintToOutFile("<P>"); # One little wrinkle, turn the "#" in name into "%23" copyOfName = theGIFName; sub(/#/, "%23", copyOfName); PrintToOutFile("<IMG SRC=\"" theLocation copyOfName "\" ALIGN = \"top\">"); PrintToOutFile("<P>"); } # The main table of contents. # We just print two levels. Additional levels would probably need additional # counters h3Counter, h4Counter etc and additional arrays. function DoTOC( i, j, numSubHeads, contents, showSubs) { PrintToOutFile("<H2><A NAME = \"Table_of_Contents\">" " Table of Contents " "</A></H2>"); PrintToOutFile("<UL>"); for (i = 1; i <= h2Counter; ++i) { numSubHeads = split(TOC[i], contents, SUBSEP); if (contents[numSubHeads] == "") --numSubHeads; # Print the main heading, href is file corresponding to chapter PrintToOutFile("<LI> <A HREF = \"Text/" FileNameForChapter(contents[1]) "\"> " contents[1] " </A>"); # then print the subheadings if (numSubHeads > 1) { # Check there are some headings to show - don't show if name starts with "!" showSubs = 0; for (j = 2; j <= numSubHeads; ++j) { if (index(contents[j], "!") != 1) { showSubs = 1; break; } } if (showSubs == 1) { PrintToOutFile("\t<UL>"); for (j = 2; j <= numSubHeads; ++j) { # href consists of location, file name (from chapter name), "#", subsection name if (index(contents[j], "!") != 1) PrintToOutFile("\t<LI> <A HREF = \"Text/" FileNameForChapter(contents[1]) "#" NoQuestionVersionOf(contents[j]) "\"> " contents[j] " </A>"); } PrintToOutFile("\t</UL>"); } } } PrintToOutFile("</UL>"); } function DoChapterTOC( i, j, numSubHeads, contents) { # Print link to main table of contents PrintToOutFile("<A HREF = \"../" contentsFileName "#Table_of_Contents\">Main Contents</A>"); # Print chapter's table of contents i = h2Counter; numSubHeads = split(TOC[i], contents, SUBSEP); if (contents[numSubHeads] == "") --numSubHeads; # Print the main heading PrintToOutFile("<H2> <A HREF = \"#" NoQuestionVersionOf(contents[1]) "\"> " contents[1] " </A></H2>"); if (numSubHeads > 1) { PrintToOutFile("\t<UL>"); for (j = 2; j <= numSubHeads; ++j) { # Trim any leading "!" if (index(contents[j], "!") == 1) contents[j] = substr(contents[j], 2); PrintToOutFile("\t<LI> <A HREF = \"#" NoQuestionVersionOf(contents[j]) "\"> " contents[j] " </A>"); } PrintToOutFile("\t</UL>"); } } function WriteFinalChapter( haveSeenContents) { haveSeenContents = 0; #speed things up with a simple "boolean" # Get lines from oldOutFile to the variable line. while (getline line < oldOutFile > 0) { if (haveSeenContents == 0 && line ~ contentsMarker) { DoChapterTOC(); PrintToOutFile(""); haveSeenContents = 1; } else PrintToOutFile(line); } } # Write the main file, table of contents at one level above the chapter documents. # Finish any open chapter first. function WriteMainContentsFile() { if (outFile != "") FinishCurrentChapter(); outFile = contentsFileLocation contentsFileName; PrintToOutFile("<HTML>"); PrintToOutFile(""); PrintToOutFile("<HEAD>"); PrintToOutFile("<TITLE>" theMainTitle "</TITLE>"); PrintToOutFile("</HEAD>"); PrintToOutFile(""); PrintToOutFile("<BODY>"); PrintToOutFile("<H1>" theMainHead "</H1>"); PrintToOutFile("<HR>"); PrintToOutFile(""); DoTOC(); PrintToOutFile(""); PrintToOutFile("</BODY>"); PrintToOutFile(""); PrintToOutFile("</HTML>"); PrintToOutFile(""); PrintToOutFile(""); close(outFile); } # Working within the current bounds of hAWK, we can (just barely) persuade # a folder to come into existence by using "copy", which creates folders # along the specified path if possible. So we make a file, copy it to the # folder we want to exist, and then remove both versions of the file. Ugh. function MakeFolder(folderPathName, xFile, xFileSource, xFileDest) { xFile = "Temp1342134HIKE!"; xFileSource = STDPATH xFile; xFileDest = folderPathName xFile; print "Hello" > xFileSource; close(xFileSource); copy(xFileSource, xFileDest); remove(xFileSource); remove(xFileDest); }